#Data download and wrangling
library(curl)
library(dplyr)
library(tidyr)
#Machine Learning
library(mlr3verse)
library(mlr3spatiotempcv)
library(mlr3extralearners)
library(lightgbm)
#visualisation
library(ggplot2)
library(ggtext)
library(ggpmisc)
library(tmap)
library(gt)
library(scales)
#spatial
library(terra)
library(sf)
library(rstac)
library(gdalcubes)Introduction
This workbook outlines a minimal example, using the R programming language, for generating a canopy height model from Earth Observation (EO) data using Machine Learning. This workbook should be reproducible providing that the required packages are installed.
The workbook will demonstrate how to download training data from a Zenodo repository, generate a coincident Landsat 8 annual composite map and apply a short machine learning pipeline done locally in R with no need for cloud compute. However, the scale of the example is relatively small and so larger Areas of Interest (AOI) would benefit from greater computational power.
Let’s get into it…
Project set up
First let’s load the libraries we need for the analysis. If you don’t have these installed, you can run install.packages(<PACKAGE_NAME>) to install them.
Now let’s set a random seed number - this is important as we’re going to be using algorithms (actually just one in this example) that depend on random number generation; by setting this value we ensure that the results are reproducible
set.seed(5446) Training Data
Now let’s download some Tree Canopy Height (TCH) data generated as part of a biomass modelling paper by Asner et al. (2018). We’re going to use this as our “truth”. In reality, this is actually another model, created using Landsat data and Aerial LiDAR Survey (ALS) data from 2016. Therefore, this example is more like a model emulation rather than estimating actual measured canopy height; but hopefully a useful example - this could be as easily replaced with any other TCH data. It is quite a large file so might take a little while…
if (!dir.exists("data")) dir.create("data")
sabah_tch_path <- "data/GAO_TCH_30m_unmasked.tif"
#https://zenodo.org/record/4549461
# file is ~ 800MB so takes a little time.
if (!file.exists(sabah_tch_path)){
tch.url <- "https://zenodo.org/record/4549461/files/GAO_TCH_30m_unmasked.tif?download=1"
curl::curl_download(tch.url, destfile = sabah_tch_path, quiet=FALSE)
}Let’s focus in on a particular area… I’ve picked this region of Sabah at random as it looked like it had a nice diversity of canopy heights and landuse. So Let’s define a centre point of our Area of Interest (AOI) and buffer it by 10km - again this is arbitrary - and solely for this example. Then we crop the training TCH data that we downloaded in the last step to be coincident with our AOI.
proj.area <- sf::st_point(c(479911, 657243))|> # # sabah example
st_sfc(crs=32650) |> # for sabah
st_buffer(10000, endCapStyle = "SQUARE")
tch.proj <- terra::rast(sabah_tch_path) |>
terra::crop(proj.area)It’s always a good idea to visualise what you’re modelling - let’s take a look… Here we’re plotting the TCH data with the {tmap} package which uses a Javascript library called leaflet (in the background) to create an interactive map. FYI, if you find this type of interactive mapping useful you should also check out the {mapview} package which does much the same but trades of some of the flexibility of tmap for simpler syntax.
#' Generate a base tmap object to display some basemaps.
#'
#' @return A tmap object
tm_basic <- function() {
tmap::tm_basemap(tmap::providers$Esri.WorldImagery) +
tmap::tm_basemap(tmap::providers$OpenStreetMap.HOT) +
tmap::tm_basemap(tmap::providers$Stamen.Terrain) +
tmap::tm_basemap(tmap::providers$CartoDB.DarkMatter)
}
tmap::tmap_mode("view")
tm_basic() +
tmap::tm_shape(tch.proj, name = "Tree Canopy Height", raster.downsample = TRUE) +
tmap::tm_raster(
palette ="viridis",
style = "cont",
title = "Tree Canopy Height"
)Earth Observation Covariate Data
Now, there are many ways to explore and download Earth Observation (EO) data. Many of us use Google Earth Engine (GEE) as a powerful cloud based platform for data processing. But it can be challenging to work both locally and in GEE (although made much easier in R with the {rgee}). Further, not all of the processing we need can be achieved in GEE (although many things can and there are frequently alternatives). This is as much about showing that there are alternative and fully open source options for processing/working with EO data.
So, introducing Spatio Temporal Asset Catalogs (STAC). STAC is basically a way that data providers can store/index their geospatial data so that it can be easily searched, filtered and downloaded using a common Application Programming Language (API).
Landsat Composite
In this next code chunk we search the Microsoft Planetary Computer (MPC) STAC catalog for Landsat 8 imagery in 2016 for our AOI using the {rstac} package. We don’t have to use MPC, there are many other catalogs that host the same data but I have found MPC to be very performant.
# Conver our AOI to Lat Long bounding box
bbox <- st_bbox(tch.proj)
bbox_wgs84 <- bbox |>
st_as_sfc(crs=32650) |>
st_transform("EPSG:4326") |>
st_bbox()
# submit get request to STAC endpoint
s = stac("https://planetarycomputer.microsoft.com/api/stac/v1/")
col.list <- rstac::collections(s) |>
get_request()
# post the request with initial filtering parameters
items = s |>
stac_search(collections = "landsat-c2-l2",
bbox = c(bbox_wgs84["xmin"],bbox_wgs84["ymin"],
bbox_wgs84["xmax"],bbox_wgs84["ymax"]),
datetime ="2016-01-01T00:00:00Z/2016-12-30T00:00:00Z" )|>
post_request()|> items_sign(sign_fn = sign_planetary_computer())
length(items$features) # How many images do we have[1] 90
Note how there are 90 images returned in the initial search; This includes different sensors with varying cloud coverages. Let’s refine this firstly by stating the assests (bands) we want to download, specify the sensor we want (Landsat 8) and finally the percent of cloud cover we will accept as a maximum.
l8_collection = stac_image_collection(
items$features,
asset_names = c(
"qa_pixel",
"coastal" ,
"blue" ,
"green",
"red" ,
"nir08",
"swir16",
"swir22"
) ,
property_filter = function(x) {
x[["eo:cloud_cover"]] < 20 & x[["platform"]]=="landsat-8"
}
)
l8_collectionImage collection object, referencing 10 images with 8 bands
Images:
name left top bottom right
1 LC08_L2SP_117056_20160830_02_T1 116.4034 6.831066 4.729064 118.4616
2 LC08_L2SP_117056_20160814_02_T1 116.4170 6.831086 4.729034 118.4752
3 LC08_L2SP_117056_20160713_02_T1 116.4007 6.831066 4.729074 118.4562
4 LC08_L2SP_118056_20160704_02_T1 114.8509 6.834146 4.727284 116.9145
5 LC08_L2SP_118056_20160501_02_T1 114.8020 6.834136 4.727134 116.8631
6 LC08_L2SP_117056_20160408_02_T1 116.4007 6.831066 4.729074 118.4562
datetime srs
1 2016-08-30T02:26:10 EPSG:32650
2 2016-08-14T02:26:03 EPSG:32650
3 2016-07-13T02:25:56 EPSG:32650
4 2016-07-04T02:32:03 EPSG:32650
5 2016-05-01T02:31:47 EPSG:32650
6 2016-04-08T02:25:37 EPSG:32650
[ omitted 4 images ]
Bands:
name offset scale unit nodata image_count
1 blue 0 1 10
2 coastal 0 1 10
3 green 0 1 10
4 nir08 0 1 10
5 qa_pixel 0 1 10
6 red 0 1 10
7 swir16 0 1 10
8 swir22 0 1 10
Now we have reduced our cube to just 10 images, it’s time to create a composite. Here we use the {gdalcubes} package to create this, first we define the structure of the cube, i.e. the spatio temporal dimensions, coordinate reference system (CRS), temporal reduction method (here we use median) and the spatial resampling method.
e <- extent(l8_collection)
gdalcubes_options(parallel = parallel::detectCores()-4) # set the number of cores to use.
v = cube_view(srs=paste0("EPSG:",st_crs(proj.area)$epsg), dx=30, dy=30, dt="P1Y",
aggregation="median", resampling = "bilinear",
extent=
list(t0 = e$t0, t1 = e$t1,
left=bbox["xmin"], right=bbox["xmax"],
top=bbox["ymax"], bottom=bbox["ymin"])
)
vA data cube view object
Dimensions:
low high count pixel_size
t 2016-01-01 2016-12-31 1 P1Y
y 647250 667230 666 30
x 469920 489900 666 30
SRS: "EPSG:32650"
Temporal aggregation method: "median"
Spatial resampling method: "bilinear"
Now let’s download the data and apply cloud masking and some pixel maths in the process to generate an annual composite for 2016 with an additional band showing the Normalised Difference Vegetation Index (NDVI).
# L8.clear_mask = image_mask("qa_pixel", bits=1:4, values=16)
L8.clear_mask = image_mask("qa_pixel", values=21824, invert=TRUE)
cube <- raster_cube(l8_collection, v, mask = L8.clear_mask) |>
apply_pixel(c("(nir08-red)/(nir08+red)"), names="NDVI", keep_bands=TRUE)
if(!dir.exists("out_data")) dir.create("out_data")
terra_cube <-
write_tif(cube, dir = "data_out", prefix = "Sabah_example_") |> #
rast()
# Rescale the imagery but not NDVI that one is fine.
terra_cube[[setdiff(names(terra_cube), c("NDVI", "qa_pixel"))]] <-
terra_cube[[setdiff(names(terra_cube), c("NDVI", "qa_pixel"))]]/10000 What did we just download? Let’s take a look… You’ll see that to generate to RGB image in R, a little extra work is needed to get a visually recognisable image. You can use terra::plotRGB but the colours might look a bit “off”. tmap::tm_rgb can also be used but ggplot does offer a lot of control so it is useful to know.
#' Convert an RGB raster to a dataframe ready to plot with ggplot.
#'
#' @param r SpatRaster object that contains bands named red, green and blue.
#' @param .min Numeric between 0 and 1. The minimum cutoff value to visualise.
#' @param .max Numeric between 0 and 1. The maximum cutoff value to visualise.
#' @param .yr Adds a labelling column - in this case the year.
#'
#' @return A dataframe representing the values of the rgb raster with x and y locations.
l8_to_rgb_df <- function(r, .min=0.001, .max=1.2, .yr=2016){
as.data.frame(r, xy=TRUE) |>
mutate(red = case_when(red<.min ~ .min,
red>.max ~ .max,
TRUE ~ red),
red =scales::rescale(red, c(0,1)),
green = case_when(green<.min ~ .min,
green>.max ~ .max,
TRUE ~ green),
green = scales::rescale(green, c(0,1)),
blue = case_when(blue<.min ~ .min,
blue>.max ~ .max,
TRUE ~ blue),
blue = scales::rescale(blue, c(0,1)),
Year = .yr)
}
#' create an RGB raster map with {ggplot2}
#'
#' @param .x SpatRaster object that contains bands named red, green and blue.
#'
#' @return A ggplot
rgb_plot <- function(.x){
l8_df <- l8_to_rgb_df(.x[[c("red", "green", "blue")]]) |> # run the function to get a dataframe
tidyr::drop_na()
ggplot(data=l8_df, aes(x=x, y=y, fill=rgb(red,green,blue))) +
geom_raster() +
scale_fill_identity() +
theme_light() +
scale_x_continuous(breaks = round(seq(min(l8_df$x), max(l8_df$x), by = 5e+3),0)) +
scale_y_continuous(breaks = round(seq(min(l8_df$y), max(l8_df$y), by = 5e+3),0))+
theme(
axis.title.x=element_blank(),
axis.title.y=element_blank()) +
coord_fixed()
}
rgb_plot(terra_cube)Terrain Data
In addition to the spectral data from our Landsat 8 composite, we might also want to include other environmental covariates which could have some relationship with tree canopy height. So, let’s add some terrain data and related geomorphometrics. We again use rstac to retrieve the DTM tiles. Here we don’t need to use gdalcubes because the global DTM is assumed to represent a snapshot in time and therefore we do not have overlapping tiles. Here, terra (which uses gdal for reading data) is very capable of reading on-line GeoTiff files from urls.
#' Get the URLs for DTM tiles.
#'
#' @param aoi_box A bbox object (using EPDG:4326 CRS)
#' @param collection default "cop-dem-glo-30". The DTM collection to request.
#'
#' @return A list of DTM tile urls from the Microsoft Planetary Computer STAC catalog.
mpc_dtm_src <- function(aoi_box,
collection = "cop-dem-glo-30"){
s_obj <- rstac::stac("https://planetarycomputer.microsoft.com/api/stac/v1")
rstac::get_request(s_obj)
it_obj <- s_obj |>
rstac::stac_search(collections = collection[1],
bbox = c(aoi_box["xmin"],aoi_box["ymin"],
aoi_box["xmax"],aoi_box["ymax"])) |>
rstac::get_request()
src_list <-rstac::assets_url(it_obj)
.urls <- src_list[grep(".tif$", src_list)]
sapply(.urls, function(x) paste0("/vsicurl/", x), USE.NAMES =FALSE)
}
tch.reproj <- project(tch.proj, terra_cube)
dtm <- mpc_dtm_src(bbox_wgs84) |> # call the function to get the tile urls
lapply(rast) |> # iterate over the tiles and load as a spatRaster object
terra::sprc() |> # convert to a collection
terra::merge() |> # merge the tiles - basic mosaic - nothing overlaps here so this is fine.
project(tch.reproj) # project into our desired CRS and extent etc.
names(dtm) <- "dtm"
# now generate a few geomorphometric layers
asp <- terra::terrain(dtm, "aspect")
slp <- terrain(dtm, "slope")
TRI <- terrain(dtm, "TRIrmsd")
rough <- terrain(dtm, "roughness")Building The Model Inputs
Now, let’s combine the Terrain and Landsat 8 composite data into a single raster with all of the bands. Then we convert this to a dataframe so that we can apply the statistical models as we need.
# combine all of our layers from L8 and the terrain.
tch.reproj <- project(tch.proj, terra_cube)
comb_dat <- c(tch.reproj,terra_cube, dtm, asp, slp, TRI, rough)
drop_bands <- c("qa_pixel") # we don't want to model with pixel masks so drop it. Also dropping lwir11 as there are a lot of missing values and not useful for our needs.
comb_dat <- comb_dat[[setdiff(names(comb_dat), drop_bands)]]
names(comb_dat[[1]]) <- "TCH"
# this will be our dataframe for the ML section
comb_df <- as.data.frame(comb_dat, xy=TRUE) |>
tidyr::drop_na()Before we start modelling, what did we just download? Let’s take a look…Here are the individual bands, plotted using ggplot2 and patchwork. Again, a simple plot(comb_dat)will work OK but does not correctly display the data due to the differences in ranges across bands.
# plotting data in ggplot requires longform data.
comb_df |>
tidyr::pivot_longer(-one_of(c("x", "y"))) |> # converts df to longform.
dplyr::group_by(name) |> # groups by band
dplyr::group_map(~ggplot(.x) + # applies plot function by band
aes(x=x, y=y, fill=value)+
geom_raster() +
guides(fill="none") +
scale_fill_viridis_c() +
coord_fixed()+
theme_void() +
labs(subtitle=.x$name[1]) +
theme(
axis.title.x=element_blank(),
axis.title.y=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank()),
.keep=TRUE) |>
patchwork::wrap_plots(nrow=3) # combines all plots into a gridMachine Learning with {mlr3}
Now let’s do some statistics/machine learning - so much jargon in this space! we’re going to use the {mlr3} ecosystem to simplify our workflow - it has lots of functionality for applying different ML techniques and algorithms. This example is very basic but check out the mlr3 book for loads of great infotmation. You may also want to consider the tidymodels ecosystem which has very similar functionality to mlr3 but is stylistically very different.
The first step with mlr3 is to define the task this is essentially a template/ instruction set for the other mlr3 functions to work with.
# ---------- GENERATE A "TASK" ------------
task = mlr3spatiotempcv::TaskRegrST$new(
id = "Sabah_CHM",
backend = comb_df,
target = "TCH",
coordinate_names = c("x", "y"),
extra_args = list(
coords_as_features = FALSE,
crs = terra::crs(comb_dat))
)You’ll note that this task is a little bit special because we use the mlr3spatiotempcv package; essentially this just manages the additional spatial/temporal attributes of our data: coordinates, CRS and time (if we were doing multi-temporal work which we aren’t here) to manage resamapling strategies.
So, let’s define the spatial resampling method that we will use to evaluate our model’s performance. We can also use this for model tuning (although this is not included in this example). Note how training and test data partitions are spatially distinct. There are many different ways to do this spatial partitioning but this method is simple and often more than suitable. It uses kmeans clustering to split the AOI into six regions (based on the requested number of folds). It is important to do this to avoid over estimating our model’s performance by evaluating predictions with neighbouring or nearby cells which are more likely to be similar due to their spatial location.
# ---------- DEFINTE A RESAMPLING PLAN ------------
spcv_plan <- mlr3::rsmp("repeated_spcv_coords", folds = 6, repeats=1)
autoplot(spcv_plan, task = task, fold_id=1:3)Now we set some “learners” (AKA algorithms) - let’s test two in this case LightGBM and a Generalised Linear Model (GLM).
# ------- DEFINE SOME LEARNERS
lgbm.lrn <- mlr3::lrn("regr.lightgbm")
lrn.glm <- mlr3::lrn("regr.glm")Let’s benchmark these learners and see which will perform better:
design = benchmark_grid(task, list(lgbm.lrn, lrn.glm), spcv_plan)
# print(design)
future::plan("multisession", workers = 6) # sets the number of cores to run on - we have
bmr = mlr3::benchmark(design)INFO [10:34:36.982] [mlr3] Running benchmark with 12 resampling iterations
INFO [10:34:37.492] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 1/6)
INFO [10:34:38.027] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 2/6)
INFO [10:34:38.591] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 3/6)
INFO [10:34:39.197] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 4/6)
INFO [10:34:39.849] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 5/6)
INFO [10:34:40.676] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 6/6)
INFO [10:34:41.077] [mlr3] Applying learner 'regr.glm' on task 'Sabah_CHM' (iter 1/6)
INFO [10:34:41.867] [mlr3] Applying learner 'regr.glm' on task 'Sabah_CHM' (iter 2/6)
INFO [10:34:42.089] [mlr3] Applying learner 'regr.glm' on task 'Sabah_CHM' (iter 3/6)
INFO [10:34:42.547] [mlr3] Applying learner 'regr.glm' on task 'Sabah_CHM' (iter 4/6)
INFO [10:34:42.941] [mlr3] Applying learner 'regr.glm' on task 'Sabah_CHM' (iter 5/6)
INFO [10:34:43.113] [mlr3] Applying learner 'regr.glm' on task 'Sabah_CHM' (iter 6/6)
INFO [10:34:44.309] [mlr3] Finished benchmark
aggr = bmr$aggregate(measures = c(msr("regr.rmse"), msr("regr.mse"), msr("regr.rsq")))
gt::gt(aggr[,4:9,])| learner_id | resampling_id | iters | regr.rmse | regr.mse | regr.rsq |
|---|---|---|---|---|---|
| regr.lightgbm | repeated_spcv_coords | 6 | 3.801996 | 14.57025 | 0.5888335 |
| regr.glm | repeated_spcv_coords | 6 | 4.703977 | 22.27541 | 0.3745909 |
So the Lightgbm model looks to be the best - we could very well get a better performance out of all these models with some tuning and feature selection but, for now, let’s keep it simple. The result from the LightGBM also looks pretty good.
So, let’s check this in more detail by running the resampling and inspecting the results by plotting the predicted sets from the resampling:
resample_lgbm <- progressr::with_progress(expr = {
mlr3::resample(
task = task,
learner = lgbm.lrn,
resampling = spcv_plan,
store_models = FALSE,
encapsulate = "evaluate"
)
})INFO [10:34:44.928] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 1/6)
INFO [10:34:45.085] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 2/6)
INFO [10:34:45.237] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 3/6)
INFO [10:34:45.514] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 4/6)
INFO [10:34:45.767] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 5/6)
INFO [10:34:46.066] [mlr3] Applying learner 'regr.lightgbm' on task 'Sabah_CHM' (iter 6/6)
resample_lgbm$aggregate(measure = c(
mlr3::msr("regr.bias"),
mlr3::msr("regr.rmse"),
mlr3::msr("regr.mse")))regr.bias regr.rmse regr.mse
0.128907 3.879918 15.239274
resample_lgbm$prediction() |>
ggplot() +
aes(x=response, y=truth)+
geom_bin_2d(binwidth = 0.3) +
scale_fill_viridis_c(trans=scales::yj_trans(0.1), option="G", direction=-1) +
geom_abline(slope=1) +
theme_light()Great, so we have our chosen model, let’s predict across our full cube using all available data - we’ve done our accuracy evaluation so it makes no sense to have any data held out for the final model. Training and predicting the model is just two lines of code:
#---------- TRAIN THE MODEL -----------
lgbm.lrn$train(task)
#----------- PREDICT THE FULL MODEL --------
p <- terra::predict(comb_dat, lgbm.lrn, na.rm=FALSE)Now, remember that the Tree canopy height data we are emulating is a model itself, also, we are not totally sure what imagery was used to generate it. So, given this, take a look at the differences between the original TCH model and our LightGBM model results and consider why there are some fairly curious differences…
#----------- Let's compare with the training data --------
l8rgb <- terra_cube[[c("red", "green", "blue")]]
l8rgb.mat <- as.matrix(l8rgb) |>
scales::rescale()
values(l8rgb) <- l8rgb.mat
tm_basic() +
tm_shape(l8rgb, name="L8-rgb", raster.downsample = FALSE) +
tm_rgb(r= 1, g=2, b=3, max.value = 1)+
tmap::tm_shape(tch.reproj, name = "Tree Canopy Height", raster.downsample = TRUE) +
tmap::tm_raster(
palette ="viridis",
style = "cont",
title = "Tree Canopy Height",
breaks=c(seq(0, 50, 10))
) +
tmap::tm_shape(p, name = "Modelled Tree Canopy Height", raster.downsample = FALSE) +
tmap::tm_raster(
palette ="viridis",
style = "cont",
title = "Modelled Tree Canopy Height",
breaks=c(seq(0, 50, 10))
)Summary
We have downloaded some training data to use as an input to train our own machine learning model.
Then we downloaded Landsat 8 data and created a cloud masked annual median composite for 2016.
Next, we downloaded a dtm (Copernicus 30 m) and derived a range of geomorphometric covariates.
We combined all of these bands of data into a single raster, and then created a data frame comprising all of these values (along with our training TCH data).
Then, we tested two ML models using their default values, benchmarked their predictive performance and, based on this benchmark, selected the LightGBM model as having better performance.
We generated resampled predictions using spatially distinct test/train sets and visualised these results alongside generating aggregated model evaluation metrics.
Finally, we trained the model and predicted these values across our data cube to create our own TCH model. In prinicple we could now apply this model into new years and new regions, taking care to ensure that the new regions are reasonably described by the environmental covariates in this training process.
Further considerations
We have done no model tuning or feature selection. Both of these things can be carried out in mlr3 and are very likely to yield significant model improvements but will take a little longer to run.
Other ML algorithms may also yield good or even better results than presented here However, LightGBM is very fast and does often generate good results for these sorts ML projects.
We haven’t applied our model into novel regions or time periods - if we chose to do this, there are some important things to consider such as “area of applicability” and spectral consistency across annual composites. we also haven’t included any data that might more directly describe vegetations structure in this example, data such as Synthetic Aperture Radar (SAR) might help to improve model performance but SAR comes with many unique challenges.